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Abstract 

We propose a new stochastic gradient method for optimizing the sum of a finite set of 
smooth functions, where the sum is strongly convex. While standard stochastic gradient meth- 
ods converge at sublinear rates for this problem, the proposed method incorporates a memory 
of previous gradient values in order to achieve a linear convergence rate. Numerical experiments 
indicate that the new algorithm can dramatically outperform standard algorithms. 



1 Introduction 

A plethora of the problems arising in machine learning involve computing an approximate minimizer 
of a function that sums a loss function over a large number of training examples, where there is a 
large amount of redundancy between examples. The most wildly successful class of algorithms for 



taking advantage of this type of problem structure are stochastic gradient (SG) methods Robbins 



and Momo 1951 Bottou and LeCun 2003 . Although the theory behind SG methods allows them 
to be applied more generally, in the context of machine learning SG methods are typically used to 
solve the problem of optimizing a sample average over a finite training set, 

1 " 

minimize q(x) := — > fAx). (1) 

i=l 

In this work, we focus on such finite training data problems where each fi is smooth and the average 
function g is strongly-convex. 

As an example, consider the case of ^2-i'egularized logistic regression, 

... A 



mmimize — llxlP 

2 n 

4=1 



- Vlog(l + exp(-Mfa:)), (2) 



where G M.P and bi G {—1,1} are the training examples associated with a binary classification 
problem and A is a regularization parameter. This problem can be put in the framework of ([I]) by 



1 



using the choice 



A 



|l.T|l2 + log(l + exp(-&,afa:)). 



More generally, any £2 -regularized empirical risk minimization problem of the form 



minimize 



^1 



1 " 



(3) 



falls in the framework of ([T]) provided that the loss functions h are convex and smooth. An extensive 



list of convex loss functions used in machine learning is given by Teo et al. 2007 , and we can even 



include non-smooth loss functions (or regularizers) by using smooth approximations. While many 
authors use sparsity-inducing regularizers such as the £i-norm which do not automatically yield a 
strongly-convex problem when combined with a convex loss, as discussed by Zou and Hastie 2005 it 



may be beneficial to use an additional £2-regularization term even when considering sparsity-inducing 
regularizers. 

When the number of training examples 11 is large, SG methods are appealing because their iteration 
costs and convergence rates are independent of n. In particular, the basic SG method uses iterations 
of the form 

x''+'=x'-akfi{x''), (4) 

where ak is a step-size and a training example ik is selected uniformly among the set {1, . . . 
The randomly chosen gradient (x'^) yields an unbiased estimator of the true gradient g'{x''), and 
one can show under standard assumptions that for a suitably chosen decreasing step-size sequence 
{ak} the SG iterations achieve the sublinear convergence rate 

ng{x'')]-g{x*) = 0{l/k), 



see 



Nemirovski et al. 2009 §2.1] and [Bach and Moulines [2011 §3.1]. Further, under certain 



assumptions, this rate can be shown to be optimal for strongly-convex optimization in a model of 
computation where the algorithm only accesses the function through unbiased measurements of its 
objective and gradient [see Nemirovski and Yudin 1983 Agarwal et al. 2010 . Thus, we cannot 



hope to obtain a better convergence rate if the algorithm only has access to unbiased measurements 
of the gradient. However, in the case of a finite training set where we may choose a training example 
more than once, the first-order stochastic oracle model is too general and better rates are possible. 

For example, the standard full gradient (FG) method uses iterations of the form 



= x''- aug\x^) = 



n 



i=l 



(5) 



and with a constant step size achieves a linear convergence rate 

g{x')-g{x*) = 0{p''), 



for some p < 1 which depends on the condition number of g [see Nesterov 2004 Theorem 2.1.15 



Linear convergence is also known as geometric or exponential convergence, because the error is cut 
by a fixed fraction on each iteration. Despite the faster convergence rate of the FG method, it can 
be unappealing when n is large because its iterations are n times more expensive than SG iterations. 

In this work we analyze a new algorithm that we call the stochastic average gradient (SAG) method 
a randomized variant of the incremental aggregated gradient (lAG) method of Blatt et al.| [2008 



which combines the low iteration cost of SG methods with a linear convergence rate as in FG 
methods. It uses iterations of the form 



=1 



n 



(6) 
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where at each iteration a random training example i/c is selected and we set 

* [j/i^^^ otherwise. 

That is, like the FG method, the step incorporates a gradient with respect to each training example. 
But, like the SG method, each iteration only computes the gradient with respect to a single training 
example and the cost of the iterations is independent of n. Despite the low cost of the SAG iterations, 
in this paper we show that the SAG iterations have an exponential convergence rate, like the FG 
method. That is, by having access to ik and by keeping a memory of the most recent gradient 
value computed for each training example z, this iteration achieves a faster convergence rate than is 
possible for standard stochastic gradient methods. 

In the next section, we review several closely-related algorithms from the literature, including previ- 
ous attempts to combine the appealing aspects of FG and SG methods. However, note that we are 
not aware of any other SG method that achieves an exponential convergence rate while preserving 
the iteration cost of standard SG methods. Section [s] states the (standard) assumptions underlying 
our analysis and gives the main technical results, while Section |4] discusses practical implementation 
issues and Section [5] presents a numerical comparison of the new method to SG and FG methods. 



2 Related Work 

There are a large variety of approaches available to accelerate the convergence of SG methods, and 
a full review of this immense literature would be outside the scope of this work. Below, we comment 
on the relationships between the new method and several of the most closely-related ideas. 

Momentum; SG methods that incorporate a momentum term use iterations of the form 



X 



Tseng 1998 . It is common to set all Pk = P for some constant (3, and in this case we can re- write 



see 

the SG with momentum method as 

k 



We can re-wr ite the SAG updates ^ in a similar form as 



x 



k 

k+1 fc 



x' -y akS{],i,..k)nix^), (7) 



where the selection function S{i, ii-.k) is set to 1/n if j is the highest iteration where j = ik and is set 
to otherwise. Thus, momentum uses a geometric weighting of previous gradients while the SAG 
iterations select the most recent evaluation of each previous gradient. While momentum can lead to 
improved practical performance, it still requires the use of a decreasing sequence of step sizes and is 
not known to lead to a faster convergence rate. 

Gradient Averaging: Closely related to momentum is using the sample average of all previous 
gradients, 
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which is similar to the SAG iteration in the form ([6|) but w here all previous gradients are used. This 
approach is used in the dual averaging method of Nesterov 2009 , and while this averaging procedure 



leads to convergence for a constant step size and can improve the constants in the convergence 
rate 



Xiao, 2010 , it does not improve on the 0(1/ k) rate. 



Iterate Averaging: Rather than averaging the gradients, some authors propose to use the basic SG 
iteration but use the average of the x*"' over all k as the final estimator. With a suitable choice of step- 
sizes, this gives the same asymptotic efficiency as Newton-like second-order SG methods and also 



leads to increased robustness of the convergence rate to the exact sequence of step sizes [Polyak and 
Baher's method 



Juditsky 1992 



see 



Kushner and Yin 2003 



1.3.4] combines gradient averaging 
with online iterate averaging, and also displays appealing asymptotic properties. However, the 
convergence rates of these averaging methods remain sublinear. 

Stochastic versions of FG methods: Various options are available to accelerate the conver- 
gence of the FG method for smooth functions, such as the accelerated full gradient (AFG) method 
of Nesterov 1983 , as well as classical techniques based on quadratic approximations such as non- 
linear conjugate gradient, quasi-Newton, and Hessian-free Newton methods. Several authors have 



presented stochastic variants of these algorithms Sunehag et al. 2009 Ghadimi and Lan 2010 



Martens 20101 



Under certain conditions these variants are convergent and improve on the constant 

Alternately, if we split the convergence rate into a de- 



in the 0{l/k) rate Sunehag et al. 2009 



terministic and stochastic part, it improves the convergence rate of the deterministic part Ghadimi 



and Lanj 2010| . However, as with all other methods we have discussed thus far in this section, we 



are not aware of any existing method of this flavor that improves on the 0(l/fc) rate. 

Constant step size: If the SG iterations are used with a constant step size (rather than a decreasing 
sequence), then Nedic and Bertsekas 2000 Proposition 2.4] show that the convergence rate of the 



method can be split into two parts. The first part depends on k and converges linearly to 0. The 
second part is independent of k but does not converge to 0. Thus, with a constant step size the 
SG iterations have a linear convergence rate up to some tolerance, and in general after this point 
the iterations do not make further progress. Indeed, convergence of the basic SG method with a 
constant step size has only been shown under extremely strong assumptions about the relationship 
between the functions fi Solodov 1998 . This contrasts with the method we present in this work 



which converges to the optimal solution using a constant step size and does so with a linear rate. 

Accelerated methods: Accelerated SG methods, which despite their name are not related to 
the aforementioned AFG method, take advantage of the fast convergent rate of SG methods with 
a constant step size. In particular, accelerated SG methods use a constant step size by default, 
and only decrease the step size on iterations where the inner-product between successive gradient 
estimates is negative Kesten 1958 , Delyon and Juditsky 1993 . This leads to convergence of the 



method, and allows it to potentially achieve periods of linear convergence where the step size stays 
constant. However, the overall convergence rate of the method remains sublinear. 

Hybrid Methods: Some authors have proposed variants of the SG method for problems of the 
form ([!]) that seek to gradually transform the iterates into the FG method in order to achieve a linear 



convergence rate. Bertsekas 1997 proposes to go through the data cyclically with a specialized 



weighting that allows the method to achieve a linear convergence rate for strongly-convex quadratic 
functions. However, the weighting is numerically unstable and the linear convergence rate presented 
treats full cycles as iterations. Friedlander and Schmidt 2011 consider grouping the fi functions 

In both cases, 



into 'batches' of increasing size and performing SG iterations on the batches, 
iterations that achieve the linear rate have a cost that is not independent of n. 



the 



Incremental Aggregated Gradient: Finally, Blatt et al. 2008 present the most closely-related 



algorithm, the lAG method. This method is identical to the SAG iteration (|6]), but uses a cyclic 
choice of ik rather than sampling the ik values. This distinction has several important consequences. 
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In particular, Blatt et al. are only able to show that the convergence rate is linear for strongly-convex 
quadratic functions (without deriving an explicit rate), and their analysis treats full passes through 
the data as iterations. In contrast, in this work we give an explicit linear convergence rate for the 
SAG iterations for general strongly-convex functions using iterations that only examine a single 
training example. Further, as our analysis and experiments show, when the number of training 
examples is sufficiently large the SAG iterations achieve a linear convergence rate under a much 
larger set of step sizes than the lAG method. This leads to more robustness to the selection of 
the step size and also, if suitably chosen, leads to a faster convergence rate and improved practical 
performance. 



3 Convergence Analysis 



In our analysis we assume that each function fi in ([I]) is differentiable and that each gradient fl is 
Lipschitz-continuous with constant L, meaning that for all x and y in W we have 

\\ni^)-fM\\<L\\x~y\\. 

This is a fairly weak assumption on the fi functions, and in cases where the fi are twice-differentiable 
it is equivalent to saying that the eigenvalues of the Hessians of each are bounded above by L. 
In addition, we also assume that the average function g = - fi strongly-convex with constant 
fj, > 0, meaning that the function x f-7> g{x) — ^||a;|P is convex. This is a stronger assumption 
and is not satisfied by all machine learning models. However, note that in machine learning we are 
typically free to choose the regularizer, and we can always add an ^2-regularization term as in ^ to 
transform any convex problem into a strongly-convex problem (in this case we have fJ- > X). Note 
that strong-convexity implies that the problem is solvable, meaning that there exists some unique x* 
that achieves the optimal function value. Our results assume that we initialize yf to a zero vector for 
all i, and they depend on the quantity cr^ = ^ J^i \\fi{^*)\\'^i which is the variance of the gradients 
at the optimum x* . 

We first consider the convergence rate of the method when using a constant step size of ak — 2^1' 
which is similar to the step size needed for convergence of the lAG method in practice. 

Proposition 1 With a step size of ak — 2KL' SAG iterations satisfy for k>l that 
E[l|.'-.-fl<(l-i^) 



3||a;o 



„* l|2 



9a' 
4Z2 



The proof is given in the Appendix. Note that the SAG iterations also obtain the 0(1/ k) rate of 
SG methods, since 

('-sfe)*^-(-S)^«('/^)> 

albeit with a constant which is proportional to n. But they are advantageous over SG methods in 
later iterations because they obtain a linear convergence rate as in FG methods. We also note that 
a linear convergence rate is obtained for any a ^ 2^ ■ 

With a step size of ak = ^r^, the SAG iterations behave in a similar way to the lAG iterations 



proposed by Blatt et al. 2008 , while n SAG iterations using this step size behave in a similar 
way to an FG method with a step size of 1/2L. Thus, with this small step size, there is not a 
large difference between these three methods. However, our next result shows that, if the number of 
training examples is slightly larger than L/ fi (which will often be the case, as discussed in Section|6|, 
then the SAG iterations can use a much larger step size and obtain a better convergence rate that 
depends on the number of training examples but not on /i or L. 
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Data set 


Training Examples 


Variables 


Reference 


quantum 


50 000 


78 


jCaruana et al. 


12004' 


protein 


145 751 


74 


[Caruana et al. 


2004 


sido 


12 678 


4 932 


IGuyon, 2008 


rcvl 


20 242 


47 236 


1 Lewis et al., 2004 


covertype 


581 012 


54 


Blackard, Jock, and Dean Frank and Asuncion 2010 



Table 1: Binary data sets used in experiments. 



Proposition 2 If ^ ^ ^, with a step size of au = the SA G iterations satisfy for k n that 



¥.[g{x'') - g{x*)\ ^c{\ 

with C 



8n 



(81og(] 



In this result we assume that the first n iterations of the algorithm use stochastic gradient descent 
and that we initialize the subsequent SAG iterations with the average of the iterates, which is why 
we state the result for k ^ n. This leads to an 0(l/fc) rate with a constant that is now only 
proportional to log(n), while if we used the SAG iterations from the beginning we can obtain the 
same convergence rate but the constant is again proportional to n. Also, note that this bound is 
obtained when initializing all yi to zero after the stochastic gradient phasej^ 

Since Proposition 2 is true for all the values of satisfying ^ ^ f ' '^^ '^^^ choose fj, = and 
thus a step size as large as ak = and still get the same convergence rater] Note that we have 
observed in practice that the lAG method with a step size of ak = method with 

a step size of ^ may diverge, even under these assumptions. Thus, for certain problems the SAG 
iterations can use a much larger step size, which leads to increased robustness to the selection of 
the step size. Further, as our analysis and experiments indicate, the ability to use a large step size 
leads to improved performance of the SAG iterations. 

While we have stated Proposition 1 in terms of the iterates and Proposition 2 in terms of the function 
value, note that the rates obtained on iterates and function values are equivalent because, by the 
Lipschitz and strong-convexity assumptions, we have 

'l\\x''-x*r^g{x'^)-g{x*)^^\\x''-x*r- 



4 Implementation and First Pass 

The bound in Proposition [2] assumes that the initialization of the SAG algorithm depends on the 
iterates obtained using n iterations of SG. In practice, this is not ideal as it requires the setting of 
two step sizes, one for the SG algorithm and one for the SAG algorithm. Thus, in our experiments 
we instead run the SAG algorithm from the beginning but we replace the factor n in the step size by 
m, the number of unique ik values we have sampled (which converges to n). This gives the algorithm 
a decreasing sequence of step sizes until all n training examples have been seen, making the initial 

^ While it may appear suboptimal to not use the gradients computed during the n iterations of stochastic gradient 
descent, using them only improves the bound by a constant. 

more precise analysis, however, gives a rate of ^1 — + :^t^'^ for any A* greater than In this setting, 
increasing the step size decreases the convergence rate. 
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iterations very similar to SG iterations. In our experiments this modified SAG algorithm performed 
slightly better than using the SG method for the first iteration. 

A natural way to implement the SAG iteration ([6| is to have a current iterate x, a set of n variables 
yi (initially zero), a search direction d, and to use the pseudo-code 

d^ d-yi, 

d^d + y„ 

X X d. 

m 

Other recursions may be more appealing depending on the structure of the problem. For example, 
for ^2 -regularized objectives like (|3]), it may be more appealing to have the yi variables only represent 
the contribution of the loss li and use 

d^ d-yi, 

d^d + yi, (9) 
1 X a . 



ml m 



5 Experimental Results 

In our experiments we compared the following methods: 

1. FG: The full gradient method described by iteration ([s]). 



2. AFG: The accelerated full gradient method of Nesterov] [1983 , where iterations of ^ are 
interleaved with an extrapolation step. 



3. lAG: The incremental aggregated gradient method of Blatt et al. 2008 described by itera- 
tion Q with a cyclic choice of . 

4. SG: The stochastic gradient method described by iteration we tested both a constant step 
size and a decreasing sequence of step sizes. 

5. SAG: The proposed stochastic average gradient method described by iteration ([6]). 

The theoretical convergence rates suggest the following strategies for deciding on whether to use an 
FG or an SG method: 

1. If we can only afford one pass through the data, then the SG method should be used. 

2. If we can afford to do many passes through the data (say, several hundreds), then an FG 
method should be used. 

We expect that the SAG iterations will be most useful between these two extremes, where we can 
afford to do more than one pass through the data but can not afford to do enough passes to warrant 
using FG algorithms. To test whether this is indeed the case on real data sets, we performed 
experiments on the set of freely available benchmark binary classification data sets listed in Table [T] 
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The quantum and protein data sets were obtained from the KDD Cup 2004 websitej^the sido data 
set was obtained from the Causahty Workbench website]^ while the rcvl and covertype data sets 
were obtained from the LIBSVM Data websiterl We plot in Figurcjljthe results obtained when using 
the i?2-regularized logistic regression problem and in Figure |2| the results obtained when using 



the Huberized support vector machine of Rosset and Zhu 2007 with a parameter of 0.5, which is 



minimize 

xeRp 



1 " 

n ^ — ^ 



with the choice 



for biafx > 1, 

0.75 — biufx for biafx < 0.5, 
^ (1 — biofx)-^ otherwise. 



This is a smooth objective function that is very similar to the objective used in support vector 
machines. 




Effective Passes 



Effecfive Posses 



Figure 1: Comparison of optimization strategies for ^2 -regularized logistic regression. The top row 
gives results on the quantum (left), protein (center), and sido (right) data sets. The bottom row 
gives results on the rcvl (left) and covertype (right) data sets. This figure is best viewed in colour. 



In our experiments we set the regularization parameter A to which is the smallest that would 
typically be used in practice (as discussed in the next section) and which results in difficult op- 
timization problems. We plot the results of the different methods for the first 75 effective passes 
through the data in Figures [l] and [2j For dense problems the runtime of all methods is 0{p) per 
training example, though in practice the actual runtime will depend on the specific computational 
architecture (sparse problems are discussed in the next section). Similar to SG methods, setting the 

' http : //osmot ■ cs ■ Cornell . edu/kddcup | 

^ http : //www . causality . inf . ethz . ch/home .php 

' Kttp : //www. csie .ntu. edu. tw/~cjliii/libsvmtools/datasets 
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Figure 2: Comparison of optimization strategies for ^2-regularized Huberized support vector ma- 
chine. The top row gives results on the quantum (left), protein (center), and sido (right) data sets. 
The bottom row gives results on the rcvl (left) and covertype (right) data sets. This figure is best 
viewed in colour. 



step size in the lAG and SAG algorithms is a difficult issue so we only plot the step size that had 
the lowest function value after 75 passes (among powers of 10), and for the stochastic methods we 
also plot this step size divided by 10 to test the effect of choosing a poor step size. In the plots, 
we define the "best" function value as the lowest function value found across the methods and step 
sizes after 100 passes. 

We can observe several trends across the experiments. First, we note that the SG methods usually do 
substantially better than the full gradient methods (FG and AFG) on the first few passes through the 
data. However, because of their steady progress we also see that the full gradient methods slowly 
catch up to the SG methods and eventually (or will eventually) pass them. The SAG iterations 
seem to achieve the best of both worlds. They start out substantially better than the full gradient 
methods, but continue to make steady (linear) progress which leads to better performance than SG 
methods. Indeed, after 25 passes through the data and beyond, the SAG iterations with a suitably 
chosen step size achieve the lowest error on every data set in the case of the ^2-regularized logistic 
regression and three out of five datasets in the case of the ^2 -regularized Huberized SVM. Further, in 
the case of the ^2-regularizcd logistic regression, the SAG iterations even achieve a lower error than 
all other methods when not using the best step size in all but one case, indicating some robustness 
to this selection. This is in contrast to the SG methods where in many cases not using the best step 
size caused them to have similar or worse performance than the full gradient methods. Finally, we 
note that it is surprising that the randomized SAG method tends to outperform the closely-related 
deterministic lAG method by such a large margin, and we believe that this is due to the larger 
step sizes used by the SAG iterations; the step size that achieved the best performance for the SAG 
iteration was between 10 and 10000 times larger than for the lAG method. 
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6 Discussion 



Step-size selection and termination criteria: The three major disadvantages of SG methods 
are: (i) the slow convergence rate, (ii) choosing the step size while running the algorithm, and (iii) 
deciding when to terminate the algorithm. Addressing the slow convergence rate was the focus of this 
paper, but as with previous work on SG methods we have not provided a fully satisfactory way to 
address choosing the step size. However, the SAG iterations achieve a linear convergence rate for any 
sufficiently small constant step size, which is in contrast to SG methods where misspecifying the step 



sizes has a disastrous effect on the convergence rate, see Nemirovski et al. 2009 §2.1] and Bach and 



Moulines 2011 §3.1]. We also note that the SAG iterations suggest a natural termination criterion; 



since the search direction converges to zero we can use 
optimality of Xk ■ 



^)/a'^|| as an approximation of the 



Mini-batches: Because of the use of vectorization and parallelism in modern architectures, practical 
SG implementations often group training examples into 'mini-batches' and perform SG iterations 
on the mini-batches. We can also use mini-batches within the SAG iterations, and our analysis even 
provides some guidance on the choice of mini-batch size. Specifically, in machine learning problems 
we can often obtain an estimate of L and /x and we can use these to ensure that the number of 
mini-batches is large enough to allow using the larger step-size from Proposition 2. Alternately, in 
cases where n is smaller than fi/L, our analysis suggests somewhat paradoxically that it may be 
advantageous to duplicate training examples in order to increase n. We have performed preliminary 
simulations using this technique, and found that in some cases duplicating training examples and 
then applying the SAG iterations with a large step-size can be beneficial. 

Sparse problems: In their present form, the SAG iterations are unnappealing for sparse problems 
because we need to store the yi variables and the update from to is typically dense. However, 
both of these effects can be lessened by the use of mini-batches, and sparsity in the data will typically 
lead to reduced storage requirements. For example, in recursion ^ the sparsity pattern of the Ui 
variables exactly matches the one of the yi variables. 

Optimal regularization strength: One might wonder if the additional hypothesis in Proposition 
2 is satisfied in practice. In a learning context, where each function fi is the loss associated to a 
single data point, then L is equal to the largest value of the loss second derivative (1 for the square 
loss, 1/4 for the logistic loss) times the uniform bound R on the norm of each data point. Thus, the 
constraint t- ^ f is satisfied wh en A ^ In low-dimensional settings, the optimal regularization 



parameter is of the form C/n [Liang et al. 2009 where C is a scalar constant, and may thus 



violate the constraint. However, the improvement with respect to regularization parameters of the 
form A = G I \fn is known to be asymptotically negligible, and in any case, in such low-dimensional 
settings, regular stochastic or batch gradient descent may be efficient enough in practice. In the 
more interesting high-dimensional settings where the dimension p of our covariates is not small 
compared to the sample size n, then all theoretical analyses we are aware of advocate settings of A 
which satisfy this constraint. For examp le, Sridharan et al. 20 08| co nsider parameters of the form 
A = 



in the parametric setting, while 
in a non-parametric setting. 



Eberts and Steinwart 



2011 



consider A — ^ with /3 < 1 



Algorithm extensions: Our analysis and experiments focused on using a particular gradient ap- 
proximation within the simplest possible gradient method. However, there are a variety of alternative 
gradient methods available. It would be interesting to explore SAG-like versions of AFG methods 
and other classical optimization methods. The latter methods typically achieve better performance 
by adaptively choosing the step size, and it is intriguing to consider whether it would be possible 
to use the previous function value computed for each to adaptively set the step size in SAG-like 
algorithms. Other interesting directions of future work include using non-uniform sampling (such as 
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sampling proportional to the Lipschitz constant of each function) , and exploring variants of the SAG 
iteration that, following 'Agarwal and Duchi 2011 



preserve the fast convergence rates. 



work on large-scale distributed architectures but 
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Appendix: Proofs of the propositions 



We present here the proofs of Propositions [T] and [2] 



A.l Problem set-up and notations 

We use 5 = ^ Y^7=i denote a /x— strongly convex function, where the functions /i, i = 1, . . . , n 
are convex functions from W to M with L-Lipschitz continuous gradients. Let us denote by x* the 
unique minimizer of g. 

For fc 1, the stochastic average gradient algorithm performs the recursion 

n 



where an ik is selected in {1, . . . , n} uniformly at random and we set 



,k-l 



otherwise. 



Denoting a random variable which takes the value 1 — ^ with probability ^ and — ^ otherwise 
(thus with zero expectation), this is equivalent to 



n 



„k-l 



n ' 



with 



fix) 



I fiix) 



V f^{x) 



/ z'll 



Using this definition of z'^, we have E[(z'^)(z''')^] ^ -^I — ^ee^. 
We also use the notation 

( yt\ I fiix*) \ 



V x'^ J 



o(n+l)p 



fn{x*) 



o(n+l)p 



Finally, if M is a x tp matrix and m is a x p matrix, then: 



• diag(M) is the tp x p matrix being the concatenation of the t {p x p)-blocks on the diagonal 
of M; 

• Diag(m) is the tp x tp block-diagonal matrix whose {p x p)-blocks on the diagonal are equal 
to the (p X p)-blocks of m. 
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A. 2 Outline of the proofs 

Each Proposition will be proved in multiple steps. 

1. We shall find a Lyapunov function Q from to M such that the sequence 'E,Q{9^) 
decreases at a linear rate. 

2. We shall prove that Q{9^) dominates Hx*^ — (in the case of Proposition 1) or g{x^)—g{x*) 
(in the case of Proposition 2) by a constant for all k. 

3. In the case of Proposition 2, we show how using one pass of stochastic gradient as the initial- 
ization provides the desired result. 

Throughout the proofs, Tk will denote the tr-field of information up to (and including time fc), i.e., 
Fk is the (T-field generated by z^,. . . ,z''. 



A. 3 Convergence results for stochastic gradient descent 

The constant in both our bounds depends on the initialization chosen. While this does not affect the 
linear convergence of the algorithm, the bound we obtain for the first few passes through the data is 
the 0(l/fc) rate one would get using stochastic gradient descent, but with a constant proportional 
to n. This problem can be alleviated for the second bound by running stochastic gradient descent 
for a few iterations before running the SAG algorithm. In this section, we provide bounds for the 
stochastic gradient descent algorithm which will prove useful for the SAG algorithm. 

The assumptions made in this section about the functions fi and the function g are the same as the 
ones used for SAG. To get initial values for and y", we will do one pass of standard stochastic 
gradient. 

We denote by cr'^ = ^ X]"=i ll/K^^*)!!^ the variance of the gradients at the optimum. We will use the 
following recursion: 

5^- =5^-1-^,4(5^-1) . 



Denoting dk = E||i — x*\\ , we have (following Bach and Moulines 
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5k ^ 4-1 - 27fc(l - 1km [9'{x^-^)'{x^-^ - X*)] + 2jla^ . 

Indeed, we have 

-x*r = iii'^-i - x*r - 2^kfi{i''-Y{i''-' x*)+^i\\fiii'^-')r 

^ p'^-i ~x*r- 2^kfi{i'-'V{i'-' - X*) + 2^ii\\n^{x*)r + 27,^ii4(i'=-i) - 4(x*)ii 

^ - x*f 2^kfiii'-Yii''-' - x*)+2-fl\\fiix*)r 

+ 2^7.^(4(5^-1) - fi{x*)Vii'-' -X*). 

By taking expectations, we get 

E - x*\\'\Tk^,] ^ \\x''~' - x*\\^ - 27feg'(£'=-i)T(i'=-i - x*) + 2-fla' + 2L-flg' (i'^-^ {i'^-' 
E [lli'^ -x*\\'] < E - x*\\'] - 27,(1 - 7feL)E [g' (i'^-y {i''-' - x*)] + 2^la^ 



Thus, if we take 



Ik 



2L + ffc 
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we have 7fe < 27^(1 - 7fcL) and 

Sk < Sk-i - 7fcE [g'{x''-y{x''-' - X*)] + 27^^' 
< Sk-i - Ik 



E \g{x^ ^) — g{x*)\ + ^^k-i + 27^0"^ using the strong convexity of g 













V7fe 
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4-1 + 27fcfT^ 

< - (2L + ^A-) 4 + (2i + ^(fc - 1)) 4-1 + ^jkCT^ . 



Averaging from j = to A; — 1 and using the convexity of 5, we have 

i=0 

fc-1 



=1 

2 k 



i=0 



A; 



7i 



2—1 



2a2 /•'= 




1 


2L 


+ 1t 




1 + 


4L J 



dt 



A.4 Important lemma 



In both proofs, our Lyapunov function contains a quadratic term R{6^) = {O'^ ~ ^*^^ ^ c ) " 

for some values of A, b and c. The lemma below computes the value of R{6^) in terms of elements 
of ^'^-i. 



Lemma 1 If P 



A b 
b'^ c 



E 



6"^ c 

1 



, /or e ]R"PX"P, 6 e ]R"P^P anrf c e M^^p, t/ien 
•?fc-i 



1- - ) ^ + -Diag(diag(^)) 
n / n 



+ ^{fix^-') - f{x*)f Diag(diag(5))(/'(x'=-i) - fix*)) 
+ - f'{x*)V [S - Diag(diag(5))] (/'(x^^-^) - f'{x*)) 



1 



+ 2{l--]{y'^-'-f'{xn) 



b ec 



n 



{x'^-^-x*) 



+^(/'(^'-')-/'(^*)r 

+ - x*)^c{x''-^ - x*) 



ec 

n 



{x''-^-x*) 
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with 



Proof We have 



S be' eb' + -^ece' 

n n n-' 



E 



C 



-r) 



fe-i 



E [(/ - f'{x*)yA{y^ - fix*)) + 2(/ - " ^*) + i^" " x*)'^ cix'^ - x*)\Tk- 



(10) 



The first term (within the expectation) on the right-hand side of Eq. ( 10 ) is equal to 



(y'^- - fix*)) ' Aiy'^ - fix*)) = ( 1 - ^ ) iv"-' ~ fi^l) ' Mv'-' ^ fi^*)) 

+ -^ifix'-') - fix*))'^Aifix'-^) fix*)) 



+ [dmgiz')ifix'-') - y'-')] ' A[diag(z'=)(/'(a;^-i) - y'^-^' 

+ ^ f 1 - -) iy"-' - fix*))'^Aifix''-^) - fix*)) 

n \ n J 



The only random term (given J-k-i) is the third one whose expectation is equal to 
E [[diag(z'=)(/'(a;'=-i) - /-i)]^A[diag(z'=)(/'(x'=-i) - y''-')m~i] 

(/'(x'=-')-y'^')- 



n 



Diag(diag(74)) A 

n 



The second term (within the expectation) on the right-hand side of Eq. ( 10 1 is equal to 

(/ - fix*))'^bix'' ~ X*) - (l - ^) - fix*)Vbix''-^ - X*) 

+ -ifix''-')-fix*))'^bix'^~'-x*) 
n 

iy^-^ - /'(x*))T6eT(z/-i - fix*)) 



n 



n 



-- f 1 - - ) ifix^-') /'(x*))^6e^(/-i - fix*)) 
n n \ n 

f 1 - - ) iv"-' fix*))'^be'^ifix'-') fix*)) 
n n \ n , 

a 1 



-^ifix'^-') - fix*))'^be'^ifix''-') ~ fix*)) 

' k\/fi/k~l\ „.k-l\^'Ti,/k\'T\ifiik-l\ „,k~l\ 



n 



[diag(z'=)(/'(x'=-i) - y'=-i)]^6(2'=)^ [{fix''-') ~ Z"^)] 



The only random term (given J-k-i) is the last one whose expectation is equal to 
E [[diag(z'=)(/'(a;'^-i) ~ /-i)]T5(z'=)^ [(/'(^'-') - /-')] l-^fe-i] 

= ^(/'(a^'-') - /-')^ ('Diag(diag(6e^) - heA ifix''-') - /"i) 
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The last term on the right-hand side ol Eq. ( [lO| is equal to 

{x'' - x*fc{x'' - X*) = - x*fc{x''-^ - X*) 



a 



1 - 



1 



\ n 

2a 



{i/-'^f'ix*))'ece'iy'-'-nx*)) 



+ ^±-{f[x^-^) - /'(x*))Tece^(/'(x'=-i) - fix*)) 



1 



l--]{x'-'-x*yce^{y'-'-nx*)) 
n \ n , 

2a 1 



n n 
2a2 1 



ix'-'-x*)'ce'{f{x'^-')~f{x*)) 



1 - - ) iy'-' ~ nx*)Y ece' inx"^-^) - fix 



The only random term (given Tk-i) is the last one whose expectation is equal to 



E 



Diag(diag(ece )) ece 



Summing all these terms together, we get the following result: 



E 



> ^ C 



+ ^(/'(:^'"') - r{x*)ys{nx^-') ~ nx*)) 



1 



n 



Diag(diag(5)) - -5 



n \ n 



1 



+ 2(l-^)(/-i-/'(x*)) 



T 



ec 



n 

+ {x''~^ -x*yc{x''~^ -X*) 



b ec 

n 



(x' 



k-l 



with 5* = A - 5i6e ' -^^e6' + '^ece' =A-hc-^b^ + ib - '^ec)c-^{b ^ '^ecV 
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Rewriting f'{x''-^) - y''-^ = (/'(x'^^i) - f{x*)) - (y'^-i - fix*)), we have 



fix'-')-y''-Y 



Diag(diag(^)) --S 



inx'-')-y'-') 



inx"^-^) ~ fix*))-' 



Diag(diag(5)) - -S 
n 



-2(/-i-/'(x*)r 



Diag(diag(5)) ~-S 



Diag(diag(5)) -^S 



{fix'-')- fix*)) 

iy'-'-fi^l) 

if'ix'-')-fix*)). 



Hence, the sum may be rewritten as 



E 



ry 



A b 



1 



2 



1 



Diag(diag(5)) 



iy'-' - f'i^l) 



+ -ifix'^-') fix*)) ' Diag(diag(5))(/'(a:'=-i) - fix*)) 
+ liy"'' - rix*)y \S - Diag(diag(5))] if ix'-') - fix*)) 



+ 2(l--)(/-i-/'(x*)r 



ec 

n . 



ix'~'-x*) 



+ ^ifix'-')-fix*))'^ 



+ ix'~'-x*)'cix'-'-x*) 



ec 



ix'-'-x*) 



This concludes the prool. 



A. 5 Analysis for a = 

We now prove Proposition [l] providing a bound for the convergence rate of the SAG algorithm 
the case of a small step size, a — i^^. 

Proof 

Step 1 - Linear convergence of the Lyapunov function 

In this case, our Lyapunov function is quadratic, i.e.. 
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We consider 



A = 3na2/+ — (- - 2)ee^ 
n n 

b = -a(l--)e 
n 

c = I 



S 

ec 

n 



3na^I 



The goal will be to prove that E[Q{e'')\J^k-i] - (1 - S)Q{e''-'^) is negative for some 5 > 0. This 
will be achieved by bounding all the terms by a term depending on g''(a;*~^)^(a;*~^ — a;*) whose 
positivity is guaranteed by the convexity of g. 

We have, with our definition of A, b and c: 

S - Diag(diag(5)) = Sna^I - Sna'^I = 
e^ifix''-') - fix*)) = n[g\x''-') - g'{x*)] = ng'{x'-') . 

This leads to (using the lemma of the previous section): 



E[Q(e'=)|J-fc_i]=E 



1 _ 1^ 3na2(y'=-i _ - fix*)) 



+ {x''-^ - - X*) - ^(^''"' - x*)^e^{f{x''-^) - fix*)) 

+ 3a\nx^-') - fix*))-' {fix^-') - fix*)) 
-2a (l- ^] (/-I - f'{x*))^eix''-' - x*) 



1 



1 _ _ ) 3na\y'-' - fix*)) ' (y'^"^ - fix*)) 

+ (x'^-i - x*)'^ix''-^ - X*) - 2a(a;'^-i - x*)'^ g' ix'"''^) 
+ Sa'ifix"-^) - fix*))^ifix''-') - fix*)) 

-2a (l- ^] (/-I - fix*))'^eix''-' - x*) 



^ ( 1 _ _ ) 3na2(/-i - - fix*)) 



+ ix"-' - X*) ' ix^-' - X*) - 2aix^-^ - x*y^ix^-^) 



"'^-i x*Yix''-^ 
+ ia^nLix^"^ ~ x* fg'ix^-^) 



2a ( 1 - 1 ) - fix*)) ' eix^-^ - x*) . 
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The third hne is obtained using the Lipschitz property of the gradient, that is 

n 

{fix'-') - nx*)y{f{x''') - fix*)) = ^ Wfix'-') - fia 



„*M|2 



i=l 



^^Lifix'^-') - f{x*))^{x>'-' - X*) 



i=l 



nL{g'{x''-')-9'{x*))'^{x'^-'-x* 



We have 



(1 - 5)Q{e''-') = (1 - - r)^ ( fr I) (^''"^ - 



a' (\ 



inoL^I ^ 2 1 ee ' 

n \n 



+ {\-6)(x'-' -x*)^{x'-' -X*) 

- 2a(l -b)(\- ^"j (/-I - f{x*)Ye(x''-' - x*) . 



{y'--nx*)) 



The difference is then: 

E[Q(^'=)| JTfe.i] - (1 - 5)Q{e''-^) 
< (/-I - fix^)'' 



(/-I - fix*)) 



3na2 (5--\l+il-5)—(2-^\ee^ 
\ n J n \ n ' 

+ Six''-' - x*)'^ ix''-' - X*) 

- (2a - 3a2nL)(x'=-i - 

- 2aJ (l - ^ (/-I - /'(x*))^e(a;'=-i - x*). 



Using the fact that, for any negative definite matrix M and for any vectors z and t, we have 



we have, with 



M = 



3nQ2 _ 7 + (1 _ J)^ A - ee^ 
\ nj n \ n J 



z = y 



3na yS 

k-l fi 



(l -— \ + a- 
n I \ n 



3n(5 - 1 - 2(5 + 



(5-1 



ee 
n 



fix* 



t = -2a6 ( 1 - ^ ) eix"-' - x*) 
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(/-1_/'(^*))T 



a 



Znct (5-- /+(l-(5)— 2-- ee 



1 



- 2a5 ( 1 - - ) - /'(a;*)) ' e{x^-' - x*) 



^ 1 



3na^ I S 



I - 



ee 



■ a 



3n5-l-26 + 



6-1 



ee 
n 



T 



e{x^-'-x*) 



aH^ (I - 



a2 [in5~l - 2(5+ '^^^ 



||x'=-i-a;*||2 



[3n(5 - 1 - 2J + ^ 

A sufficient condition for M to be negative definite is to have <5 ^ 
The bound then becomes 

E[Q(^'=)|7-fe_i] - (1 - 5)g(e'=-i) < -(2a - 3a2nL)(a;'=-i - a;*)^.g'(a;'=-i) 

' ,52(1-1^' ' 

V n 



+ u 



[3n(5 - 1 - 2J + 



<5-l- 



I II fe-1 *||2 

n \\x — X 



We now use the strong convexity of g to get the inequahty 

1 



a;*ir< ^{x''-^ -x*)^g'{x''-^) . 



This yields the final bound 

E[Q{e'')\J^k-i] - (1 - < - I 2a - Sa^nL + 



'^'^^ {x'^-'-x*fg'ix''-^). 



Since we know that (a;*^ — x*) (7'(x ) is positive, due to the convexity of g, we need to prove 



that I 2a — 3a nL + 



5^ (1 - 1^ 



[3n(5 - 1 - 2^ + /X /X 
Using <5 = 8^ and a = 2^ gives 



is positive. 





1 




nL 




1 








8nZ 




1 




8nZ 




1 








8nZ 




1 




8nZ 




1 






>0. 



1 -^Mi-^r 



^2n 

P- 



1 _ M 

8 

1^ 



1 
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Hence, 

E[Q(0'=)|J-fe_i]-(l-<5)O(^'=-i)<O. 
We can then take a full expectation on both sides to obtain: 

EQ(6»'=) - (1 - <5)EQ(6i'=-i) < . 

Since Q is a non- negative function (we show below that it dominates a non- negative function), this 
results proves the linear convergence of the sequence EQ{6'') with rate 1 — ^. We have 



Eg(0'=)^ (l-^)'Q(0• 



Step 2 - Domination of Wx'' - x*f by Q(6»'=) 

We now need to prove that Q{6'^) dominates ||a;*^ — a;*||^. If P — ^ |j ^ is positive definite, then 
> \\\x''-x*f. 

We shall use the Schur complement condition for positive definiteness. Since A is positive definite, 
the other condition to verify is |/ — A~^h )- 0. 



3na^ + — - 2a'' 

n in 



3 3n + i - 2 n 



n ee 



^ -I 

3 3n — 2 n 

>- for n > 2 , 



and so P dominates I « 



This yields 



E||a;'=-a;*|r < 3EQ(r) 



We have 

Q(^0)=3na2^||y0-/;(x*)|| 

i 

i 

Initializing all the j/? to 0, we get 



(1 - 2n)a 



2 I (1 - 2n) 
2n3L 



i 



2a (l-^) ixo-xr (E2/.°) 



Q{0°) = 



3a^ 
41,2 



and 
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1 

2nfj, 



A. 6 Analysis for a 
Step 1 - Linear convergence of the Lyapunov function 

We now prove Proposition |2j providing a bound for the convergence rate of the SAG algorithm in 
the case of a small step size, a — 

We shall use the following Lyapunov function: 



r) 



with 



n n 
b = —ve 
c = . 



This yields 



S=^I 



rja a J- 



Diag(diag(5)) 



n n 
n 



S - Diag(diag(S')) = -(ee^ - /) 
n 



1--]S+- Diag(diag(^)) 
n J n 



-ee 



1 (l + 7?)a 



We have 

E[Q(0^)|J-fe_i]-(l-5)Q(0^-i) 

= 2g{x''-^) - 2g{x*) - 2(1 - 5)g (x^-^ + V^'') + 2(1 - 5)g{x*) 



1 2\ a 

1 — ee ' + I 77 

n J n 



T 



1 - - -ee^ 
n I n 



l-(5 — /- 1-5 - l-2zy ee^ 

n I n n n 



--{x''-'-x*Ve'{nx'^-^)~nx*)) 

n 



^^^(/'(:r'='^) - /'(x*))^(/'(x'=-i) - f'{x*)) 



2a 



{y"-' - fix*))-^ - /] if'ix'-') fix*)) 

-S] v{y^-' - f{x*))''e{x^-'-x*). 



Our goal will now be to express all the quantities in terms of [x^ — x*)^ g'{x^ ^) whose positivity 
is guaranteed by the convexity of g. 
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Using the convexity of g, we have 

-2(1 - S)g (x'^-' + < -2(1 - S) \gix'-') + "g'{x''-')e'^y' 

Using the Lipschitz property of the gradients of /j, we have 



„*\ |i2 



(/'(^'-') - f{x*)y{nx'^-') - fix*)) = J2 - 

n 

< 5^ mix"-') - f[{x*)y{x^-' - X*) 



= nL{g'{x''-')-g'{x*)y{x''-'-x*). 



Using e ' f'{x'' ^) - f'{x*)) = ng'{x^ we have 

- — (a;'=-i - x*)^e^(/'(a;'=-i) - f'{x*)) = -2v{x''-^ - x'fg'ix^-^) 

n 

^(j/'^-i - f{x*)Vee'{nx'^'^) - fix*)) = ^(/-^ - f'ixnfeg'ix'-') . 
Reassembling all the terms together, we get 

E[Q{e'')\Tk-i]-{i-s)Q{e''-') 

< 25[g{x''-') - Sg{x*)] + ^ g' {x>'-^)e' y'^-' 

1 - - -ee^ +[V- -/- (1- 5)— /- (1 - 5)-{l 

n I n V n I n n n 



2a 



(/-i-/'(x*))^(/'(x'=-i)-/'(x*)) 



+ 2i^-~Sj 7.(^-1 - /'(ar*))^e(a;'=-i - x*)- 

Using the convexity of g gives 

25[5(x'=-i) - <55(a;*)] ^ 26[x''-' - x^j'^g'ix'^-') , 

and, consequently, 

E[Q{e'^)\j^k-i]-{i-s)Q{e''-') 

< 25[(a;'=-i) - (x*)]Tg'(x'=-i) + ?^g'(x'=-i)e^/-i 



77 

1 - - I -ee^ + ( - ) -/ - (1 - 5)—/ - (1 - 5)-(l 



27/- 



n / 77 \ n I n 

77 



2rv 



+ 2 ( ^ - ^ ) 7/(/-i - /'(ar*))^e(a;'=-i - x^) 
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If we regroup all the terms in [(a;'^ ^) — {x*)] g' {x'' ^) together, and all the terms in {y'^ ^ — f'{x*)) 
together, we get 

E[g(0'=)|j-fe_i]-(i-5)g(0'=-i) 



a , 



r] — 1 



n 



I+(5 - -+ 2u{l -S)]ee 



n 



(/-I - fix*)) 



2v -26- (^^-1 - x*Vg\x'^-') 



n J 

+2(/-i-/v)r 



2 U'ix^-^) - fix*)) + ( ^ - 5)ue{x''-' - X*) + -eg'{x^-') 



n 



Let us rewrite this as 

E[g(e'=)|j-fe_i]-(i-^)Q(^'=-i) 



with 



< (/-I - r{x*)Y {ryjI + Ty/-^^ {y^-' - fix*)) 

+ {y'-' - f{x*)V [TyAfix"-') - fix*)) + Ty^.eix"-' - X*) + Ty,,eg'{x'^-')] 



TyJ 



8rj 



= a (5 - - + - 6) 



'y,9 



n 



2a 



n 



5 U 



25a 
n 



Assuming that Ty^j and Ty^e arc negative, we have by completing the square that 

+ (/-I - /'(x*))^ {Tyjinx^-') - fix*)) + Ty,,e{x''-' - X*) + Ty,,eg'{x^-')) 

< -\ [ryjinx^-') fix*)) + Ty^Mx"-' - X*) + ry,yeg'ix''-Y ^ ^ 



Ty,i 



1 ee' 



(ryjifix'^-') - fix*)) + TyMx'-' - X*) + Ty^.eg'ix"-')) 

^2 



= -Y-^Wf'ix'^-') - fixnf - lTlfn\\g'ix'^-')f 



+ Ty^e TyJ 



1 i|^fe-i_^*||2_ 1 \\g'ix^-')r 



4 TyJ + Ty^, 



4 Ty,/ + Ty,, 



2 ^y,I H~ ^y,e 



2 Ty^/ -|- Ty^e 



2 '^y^I ~\~ '^y^e 
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where we used the fact that {f'{x'^ ^) — f'{x*))^e = g^x'^ After reorganization of the terms, 
we obtain 



E[Q{e')\j^k-i]-{i-s)Q{e''-')^ 



{x"-^ - X*) ' g'{x^-^) 



1 



1 



^^''^'-\\f'{x'^-')-f'{x*r ^ 



1 T-,7..r?^ 



4r, 



yj 



4 T„,/ + r, 



y,e 



We now use the strong convexity of the function to get the following inequalities: 

Wfix''-') - f'ix*)f < Lnix'^'^ - xn^g'ix"-') 
\\x''-^ - x*f < -{x''-^ - x*fg'{x''-^) . 

Finally, we have 

E[Q{e'^)\j^k-i]-{i-s)Q{e''-') 



2(T-y,/ +Ty,e) 



r^yj" 



iTv,f+Tv,g) 



1 



Ln T, 



vJ 



1 ^y.x" 



4 Ty,/ 4/X Tyj + ry,e 



y>e 'y,J 



1 ^y,ff" 



{x''-^ - x*y g' {x''-^) 



If we choose S = ^ with (5^i,i/=^,ry = 2 and a = 1^ , we get 



1 / 2(5 1 \ 1-2(5 
2n/x \ n n n \ '^// 2n^/x \ n j 



'x,g — 1 

V 
1 



1 2(5 3L \ _ 3L 1-25 
n n 2n^ij, j 2n^/i n 



1-5 



25 



Thus, 



\'vJ ^ 'v,g> 



1 <x« 



3L 1-25 
L 



4 TyJ ^jJ, TyJ + Ty^e 



1-6 25-1 r 1 1 {i-Sf 

Ty,I + Ty,e 4 l=2i 4/iry,/+Ty,e 



1-25 



2 2(1-2(5) 



L 2-35 1-25 



n-'ti 1 - 25 



n 



+ 



n Hn'^{TyJ + Ty^, 

1 



(1-5)2 , {l-5){25-l) 



+ 



(1-5)' 



2n 



L 2-2,5 1-25 

"V 1 - 21 
L 2-3? 1-25 



+ 



2-A5 + 2n-2n5 + 25 
1-5 



n-'iJ. 1-25 



2(1 + n) 



L 1-35 1-25 1-5 

+ 



n 



n^l^ 1 - 25 



L 2-35 1-35 



2n 



n-'ti 1 - 25 



2n 



This quantity is negative for 5 < | and £ > — If we choose 5=1, then it is sufficient 

to have f ^ S. 



To finish the proof, we need to prove the positivity of the factor of \\g'{x 



.i/^k-l\\\2 



1 



1 



1 



+ 



1 V 



+ 



|2 _ ^lyJ 

4 7-y,7 



v,i 



n 

~ ^Ty,i 
>0. 



'''y,gi'^'''yJ + '''y,g) 



Then, following the same argument as in the previous section, we have 



= 1- 



8n 



2{g{x°) - g{x*)) + 



with a = - II /i (a;*) II the variance of the gradients at the optimum. 



Step 2 - Domination of g{x'') - g{x*) by Q{6^) 

We now need to prove that Q{0'') dominates g{x^) — g{x*). 
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^ 2g{x^) + — 5'(x'=)^(e^y'=) - 2g{x*) 



|^||.-„f-i(.'-x.r(eV) 



using the convexity of g and the fact that /i (^*) = 

= 2^7(x'=) - 2g{x*) + (—g'{x^) - - x*)\ [e^y") 

\ n n J 



^iieVir+^E 



yf--eV-/;(^*) 



n — 1 



„T„.i|2 



by dropping some terms. 



The quantity on the right-hand side is minimized for e^y = ini^'' ~ ^*) ~ ^9'{^^))- Hence, 
we have 



Q{e^) > 25(2;'=) - 2g{x*) 
= 2g{x^) - 2g{x*) 
> 2g{x'') - 2g{x*) 



2(n+l) 
n^/Lt / 1 



{x" - X*) - —g'{x^) 



2{n + 1) \n'2 
n:^fi / 1 



k *||2 
\x — X " 



n 
4a 

2{n + 1) \^n^ " " ' 



a;'=-a;*||^ + ^||ff'(a; 



using the convexity of g 

^ 2g{x^)-2g{x*) - 



nil 



1 



\x^-x*\? 



2{n + l) 

using the Lipschitz continuity of g' 



^ 25(x'=) - 25(x*) - 

>f^{9{x>^)-9{x*)) 
>\{9{x')-g{x*)). 



n 65 
(n + 1) 64 



(<?(x'=)-5(x*)) 
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We thus get 



= 1- 



8?! 



6n/i_ 



Step 3 - Initialization of a;" using stochastic gradient descent 

During the first few iterations, we obtain the 0(l/fc) rate obtained using stochastic gradient descent, 
but with a constant which is proportional to n. To circumvent this problem, we will first do n 
iterations of stochastic gradient descent to initialize x^, which will be renamed x" to truly reflect 
the number of iterations done. 



Using the bound from section [A. 3[ we have 

ji-i 



Eg 



n ^ 



2L. 



i'\~g{x*)^—\ 



* 
X — X 



'i=0 



1^ 



log 1 



fin\ 
4:L) 



And so, using = ^ J27=o ^^ have for fc ^ n 



E[g(x^)-5(x*)] < 1 



1 

8^ 



14L 
~3^' 



* ||2 

a; — a; ■ 



28ct^ 
3nfi 



log (l 



al) 



Since 



1 - 



8n 



we get 



E{g{x'^)-g{x*)\ ^ [\ - 



8n 



16L| 
3n 



320^ 



log 1 



4L/ 



4^2 



References 

A. Agarwal and J.C. Duchi. Distributed delayed stochastic optimization. Neural Information Pro- 
cessing Systems, 2011. 

A. Agarwal, P.L. Bartlett, P. Ravikumar, and M.J. Wainwright. Information-theoretic lower bounds 
on the oracle complexity of convex optimization. arXiv:1009.0571, 2010. 

F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for 
machine learning. Neural Information Processing Systems, 2011. 

D.P. Bcrtsekas. A new class of incremental gradient methods for least squares problems. SIAM 
Journal on Optimization, 7(4):913-926, 1997. 

D. Blatt, A.O. Hero, and H. Gauchnian. A convergent incremental gradient method with a constant 
step size. SIAM Journal on Optimization, 18(1):29-51, 2008. 

L. Bottou and Y. LeCun. Large scale online learning. Neural Information Processing Systems, 2003. 



28 



R. Caruana, T. Joachims, and L. Backstrom. KDD-cup 2004: results and analysis. ACM SIGKDD 
Newsletter, 6(2):95-108, 2004. 

B. Delyon and A. Juditsky. Accelerated stochastic approximation. SIAM Journal on Optimization, 
3:868-881, 1993. 

M. Eberts and I. Steinwart. Optimal learning rates for least squares SVMs using Gaussian kernels. 
Neural Information Processing Systems, 2011. 

A. Frank and A. Asuncion. UCI machine learning repository, 2010. URL http : / /archive . ics . 
|uci ■ edu/mll 

M.P. Friedlander and M. Schmidt. Hybrid deterministic-stochastic methods for data fitting. 
arXiv:1104.2373, 2011. 

S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic 
composite optimization. Optimization Online, 2010. 

I. Guyon. Sido: A phamacology dataset, 2008. URL |http : / /www . causality . inf . ethz . ch/data/ 
SIDO.html 

H. Kesten. Accelerated stochastic approximation. The Annals of Mathematical Statistics, 29(1): 
41-59, 1958. 

H.J. Kushner and G. Yin. Stochastic approximation algorithms and applications. Springer- Verlag, 
second edition, 2003. 

D.D. Lewis, Y. Yang, T. Rose, and F. Li. RCVl: A new benchmark collection for text categorization 
research. Journal of Machine Learning Research, 5:361-397, 2004. 

P. Liang, F. Bach, and M. Jordan. Asymptotically optimal regularization in smooth parametric 
models. Neural Information Processing Systems, 2009. 

J. Martens. Deep learning via hessian-free optimization. International Conference on Machine 
Learning, 2010. 

A. Nedic and D. Bertsekas. Convergence rate of incremental subgradient algorithms. Stochastic 
Optimization: Algorithms and Applications, pages 263-304, 2000. 

A. Nemirovski and D.B. Yudin. Problem complexity and method efficiency in optimization. Wiley, 
1983. 

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to 
stochastic programming. SIAM Journal on Optimization, 19(4): 1574-1609, 2009. 

Y. Nesterov. A method for solving a convex programming problem with rate of convergence 0(l/fc^). 
Soviet Math. Doklady, 269(3) :543-547, 1983. 

Y. Nesterov. Introductory lectures on convex optimization: A basic course. Springer, 2004. 

Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 
120(l):221-259, 2009. 

B. T. Polyak and A.B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM 
Journal on Control and Optimization, 30(4):838-855, 1992. 

H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical 
Statistics, 22(3):400-407, 1951. 



29 



S. Rossct and J. Zhu. Piecewise linear regularized solution paths. The Annals of Statistics, 35(3): 
1012 1030, 2007. 

M.V. Solodov. Incremental gradient algorithms with stepsizes bounded away from zero. Computa- 
tional Optimization and Applications, ll(l):23 -35, 1998. 

K. Sridharan, S. Shalev-Shwartz, and N. Srebro. Fast rates for regularized objectives. Neural 
Information Processing Systems, 2008. 

P. Sunehag, J. Trumpf, SVN Vishwanathan, and N. Schraudolph. Variable metric stochastic ap- 
proximation theory. International Conference on Artificial Intelligence and Statistics, 2009. 

C.H. Too, Q. Le, A.J. Smola, and SVN Vishwanathan. A scalable modular convex solver for regu- 
larized risk minimization. ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 
2007. 

P. Tseng. An incremental gradient(-projection) method with momentum term and adaptive stepsize 
rule. SIAM Journal on Optimization, 8(2):506-531, 1998. 

L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal 
of Machine Learning Research, 11:2543-2596, 2010. 

H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal 
Statistical Society: Series B, 67(2):301-320, 2005. 



30 



